\documentclass{article}
\usepackage{geometry} 
\usepackage{verbatim}
\geometry{letterpaper} 

\title{Specification for Redwood}
\author{Andy Balholm}

\begin{document}

\maketitle

\begin{abstract}

Redwood is an internet content-filtering program. 
It is designed to replace and improve on DansGuardian 
as the core of the Security Appliance internet filter. 
It adds flexibility and granularity to the filtering by classifying sites 
into multiple categories instead of just ``Allow'' and ``Block.'' 

\end{abstract}

\section{Basic Architecture}

Redwood runs as an HTTP proxy server.
It examines each HTTP message to determine if it should be allowed to proceed. 
If so, it passes the message on 
to its final destination. If not, it replaces the message with a customizable page 
stating that the request is not allowed (optionally giving the reason 
and providing a link for filing an overblock request).

Redwood's filtering is based on URLs and also, where applicable, on page content. 
(By default, content filtering is applied to files whose MIME type starts with \verb"text".)

\section{Configuration File}

By default, the main configuration file is located at \verb"/etc/redwood/redwood.conf". 
This path can be changed by using the \verb"-c" command line switch. 
Configuration options may be specified either in the configuration file or as command-line
switches. In the configuration file, they may be specified either as
\verb"key = value" or as \verb"key value". Comments are delimited with \verb"#". 
Values may be enclosed in double quotes, with the usual backslash escapes.
Additional configuration files may be included by using the \verb"include" directive.

An example configuration file:

\begin{verbatim}
# Listen for connections on port 8000.
http-proxy :8000

# the template for the block page
blockpage "/etc/redwood/block.html"

# directory of static files to be served by the internal web server
static-files-dir /etc/redwood/static

# directory of CGI scripts to run by the internal web server
cgi-bin /etc/redwood/cgi

# the directory containing the category information
categories /etc/redwood/categories

# the file containing the Access Control List configuration
acls /etc/redwood/acls.conf

# the minimum total score from a blocked category needed 
# to block a page
threshold 275

# file configuring the content pruning
content-pruning /etc/redwood/pruning.conf

# file configuring URL query modification
query-changes /etc/redwood/safesearch.conf

# path to the access log
access-log /var/log/redwood/access.log
\end{verbatim}

\section{Categories}

The configuration files for Redwood allow the user to establish any number of categories 
corresponding to the types of content that he wishes to block or to allow. 
As each HTTP message is processed, it is assigned a score in each category, 
based on the filter lists that are set up for that category. 
These scores are then used to determine whether the page should be blocked.

Each category is a assigned an action: allow, block, or ignore. 
A page will be blocked if the score for any category listed as \verb"block" 
is higher than the highest score for any category listed as \verb"allow". 
If a category is listed as \verb"ignore", its score does not affect whether a page 
is blocked or not. 
However, a page is not blocked unless the score for the highest \verb"block" category
is greater than a certain configurable threshold. This prevents overblocks 
of pages with almost no textual content.

The categories are stored in a directory whose location is specified in the configuration file. 
Each subdirectory of that directory defines a category (with the same name as the directory). 

Each category's directory contains a file named \verb"category.conf" and any number of rule-list files. 
A category named ``mechanical'' might have a \verb"category.conf" file like the following:

\begin{verbatim}
description: Auto Repair
action: allow
\end{verbatim}

This configuration would mean that the category's user-visible description would be 
``Auto Repair'' rather than ``mechanical,'' and that pages that fall into the category would be allowed. 
The description defaults to the category name, and the action 
defaults to \verb"ignore". Actions can be overriden for specific users by the use of filter groups.
A \verb"category.conf" file may also have the entry \verb"invisible: true"; 
this indicates that when a page is blocked because it belongs to that category,
the response will be an invisible image instead of the usual block page.

The rule-list files define the rules used to calculate the category's score. 
Each rule-list file must have an extension of \verb".list". 
(This rule ensures that files ending in \verb".bak", \verb".orig", etc. are ignored.) 
It is a plain-text file encoded in UTF-8. Comments are delimited with \verb"#". 
Here is an example of a rule-list file that might be in the directory for the 
``mechanical'' category mentioned earlier:

\begin{verbatim}
napaonline.com 200 # Give napaonline.com 200 points for this category.
www.napaonline.com/catalog/ 50 # bonus points for NAPA's catalog

default 150 # The following domains will each get 150 points.
carquest.com
autozone.com

/t[iy]re/ 75 # Any page with tire or tyre in the URL will get 75 points.
/parts/h 50 # A page with parts in the hostname will get 50 points

<grease gun> 25 # 25 points for each occurrence of "grease gun" in the content
<oil filter> 25 100 # 25 points for each occurrence, but no more than 100 total
\end{verbatim}

There are three kinds of filter rules:

\begin{description}

\item[URL matching] A URL matching rule consists of a domain name, optionally followed by a path. 
After it, separated by a space, is the weight---the number of points that get added to this category's score 
for sites that match the rule. 

A rule for a domain will also match subdomains: \verb"napaonline.com" also matches \verb"www.napaonline.com".
A rule with a path will also match longer paths: \verb"www.napaonline.com/catalog" also matches 
\verb"www.napaonline.com/catalog/result.aspx".

If a domain and a subdomain (or a path and a subdirectory) 
are both listed, the subdomain will effectively get the sum of the two weights. For example, if \verb"xerox.com"
were listed with 100 points, and \verb"support.xerox.com" were listed with 50 points, \verb"support.xerox.com" 
would actually get a score of 150 points.

\item[URL regular expressions] A regular expression to match the URL is listed between slashes. 
The points are added to the category score for each page whose URL matches the regular expression. 
The URL is converted to lower case before comparing it to the regular expressions.
The regular expression syntax is that supported by the RE2 library.

A regular expression can be restricted to matching a certain part of the URL by adding a one-character suffix
immediately after the final slash.
A suffix of \verb"h" matches the hostname (e.g. \verb"www.google.com"), 
\verb"d" matches the base domain name (e.g. \verb"google"),
\verb"p" matches the path,
and \verb"q" matches the query.

\item[Content phrases] Unlike the other two kinds of rules, these apply to the content of the page, not the URL. 
Phrases are enclosed between angle brackets. Before testing to see if a phrase matches, 
both the phrase and the page are simplified: capital letters are converted to lowercase, 
all characters that are not letters or digits are replaced by spaces,
and multiple spaces are replaced by single spaces. Then the phrase weight is added to the page's score for the category 
for each time the phrase is found on the page. But if the phrase has a second weight listed, 
no more than that amount will be added no matter how many times the phrase occurs. 
(In the example, if ``oil filter'' occurred more than four times, the additional occurrences wouldn't count.)

The content of the page is scanned for phrases only if phrase scanning is selected with the \verb"phrase-scan" ACL action.

\end{description}

There is also a \verb"default" rule. It specifies what weight will be assigned to rules that don't specify a weight. 
It applies to all rules without a specified weight between it and the next \verb"default" rule or the end of the file. 
If there is no \verb"default" rule, the default weight is zero.

Weights must be integers, but they may be negative. Negative weights can be used to offset short, general matches 
with long, more-specific ones, e.g.:

\begin{verbatim}
<grease> 10
<grease paint> -10
\end{verbatim}

If a page is blocked based on its URL (i.e. by URL matching and/or URL regular expressions), 
its content will not be evaluated because the page will not be downloaded.

\section{Access Control Lists (ACLs)}

Much of Redwood's functionality is configured with Access Control Lists (ACLs).
Each request is assigned a number of ACL tags, and then an action is chosen based on those tags.
For example:

\begin{verbatim}
acl no-web user-ip 192.168.1.25
block no-web
\end{verbatim}

The first line creates an ACL tag \verb"no-web", and assigns it to all requests coming from IP address 192.168.1.25.
The second line causes all requests with that tag to be blocked.

ACLs are checked at several points during the processing of a request:
before sending the request to the origin server, after receiving a response,
and after scanning the content for phrases. 
Each time, the request may have different ACL tags, since more information is available.
Each stage also has a different set of possible actions, 
although there is some overlap.
(The \verb"allow", \verb"block", and \verb"block-invisible" actions are always available.)

Any number of ACL files can be loaded with the \verb"acl" directive in the configuration file.
An ACL file can load other ACL files with a line that contains \verb"include" and the filename.

\subsection{Assigning ACL Tags}

ACL tags are assigned by lines starting with \verb"acl". 
These lines have the format:

\begin{verbatim}
acl tag-name attribute values
\end{verbatim}

The \verb"tag-name" can be any name that does not include spaces.
The \verb"attribute" refers to some property of the request or response (listed below).
The \verb"values" are a space-separated list; if any of them matches the attribute's value, the tag will be assigned.
If there is more than one \verb"acl" line with the same tag name, the tag will be applied if any of them matches (logical OR).
An ACL may have a description associated with it with a line like \verb"describe tag-name A long description".

In addition to the tags assigned by \verb"acl" lines, a request is assigned a tag for its highest-scoring category
(if the score is above the threshold).

The following attributes are available:

\begin{description}

	\item[content-type] (response only) The response's media type, usually taken from the Content-Type header.
		This can also be a generic type, with an asterisk after the slash:

		\begin{verbatim}
		acl images content-type image/*
		\end{verbatim}

	\item[method] The HTTP request method, such as \verb"GET" or \verb"POST".

	\item[referer] The request's Referer header. (This matches the same way as regular URL matching rules.)

	\item[time] The current time.

		\begin{verbatim}
		acl work-hours time MTWHF 9:00-17:00
		\end{verbatim}

		This attribute lets you select certain days of the week and/or ranges of times of the day.
		If the days of the week are specified, they must come first; they are abbreviated SMTWHFA.
		Any number of time ranges may be specified; the rule will match if the current time falls within any of them.
		Times must be in 24-hour format.

	\item[url] The URL requested. (This matches the same way as regular URL matching rules.)

	\item[user-ip] The user's IP address, or a range of addresses (in CIDR format, or with a dash).

		\begin{verbatim}
		acl managers 10.0.2.5 10.0.1.0/24 10.0.2.18-25
		\end{verbatim}

	\item[user-name] The username from HTTP proxy authentication.
	
\end{description}

\subsection{ACL Actions}

After the ACL tags are assigned, Redwood goes through the ACL files looking for an action to perform.
An action will be selected only if it has all the tags specified in the action line.
(And none of the negated tags; if a tag in an action line is preceded by an exclamation point, 
the request must not have that tag.)
Since it goes through the files in order, earlier action lines take precedence over later ones.
If it gets to the end of the file without finding a matching rule,
it will use the default action of the highest-scoring category.
If there is no category that scores over the threshold, the default action is \verb"allow".

\begin{description}

	\item[allow] Allow the request to proceed.

	\item[block] Respond with an HTTP status code of 403, and send the standard block page.

	\item[block-invisible] Respond with HTTP 403, and send an invisible 1-pixel image instead of a block page.

	\item[ignore-category] Drop the highest-scoring category off the list of categories,
		and go through the ACL files again.

	\item[phrase-scan] (response only) Run a phrase scan on the page content.
		Normally this will be configured to depend on the content type:

		\begin{verbatim}
		acl text content-type text/* application/xhtml+xml
		acl css content-type text/css
		phrase-scan text !css
		\end{verbatim}

	\item[require-auth] (request only) Send an HTTP 407 response if the request doesn't have a
		Proxy-Authorization header.

	\item[ssl-bump] (CONNECT requests only) Activate the SSLBump feature, to filter HTTPS connections.
		(Transparently intercepted HTTPS connections produce a virtual CONNECT request inside Redwood,
		so they can be filtered too.)

\end{description}

\section{URL Query Modification}

When processing an HTTP request, Redwood can modify the query parameters in the URL.
The configuration file for these changes is specified with the \verb"query-changes" keyword.
Each line contains a URL-matching or URL-regular-expression rule, 
followed by a query expression.
If the query in the URL already contains parameters with the same names as those specified in the file,
they will be replaced with the new values. Otherwise the new values will be added.

\begin{verbatim}
# Force safe search on several search engines.
/www\.google\.[^/]+/search/ safe=vss
search.lycos.com adv=1&adf=on
search.yahoo.com vm=r
/hotbot/h adf=on
www.metacrawler.com familyfilter=1
\end{verbatim}

\section{Content Pruning}

Between downloading a page and scanning its content for phrases, 
Redwood can perform ``content pruning.'' 
This is scanning the parsed HTML tree for elements matching certain criteria,
and deleting those elements and their children.

Content pruning is controlled by a configuration file.
Each line of the file contains a URL-matching or URL-regular-expression rule
to specify what site or page the pruning applies to, 
and a CSS selector to specify what elements to delete.
Between the two, there may be a threshold value. 
If a threshold is specified, the element is phrase-scanned before deleting.
It is deleted only if the score from the phrases found in a blocked category
is at least the threshold.

\begin{verbatim}
# Craigslist personals and discussion forums
craigslist.org div#ppp, div#forums, option[value=ppp]

# Bing ad sidebar
bing.com div.sb_adsNv2

# Delete questionable forum topics.
talk.newagtalk.com/forums 50 td.messagecellbody > ul
\end{verbatim}

\section{Block Pages}

When Redwood blocks access to a web page, 
it returns an HTTP response with a status of 404 Forbidden.
Unless the category that caused the page to be blocked is configured as \verb"invisible",
the body of the 404 response will be HTML rendered from a template file.
The template file is specified with the \verb"blockpage" configuration directive.
The following placeholders may be used in the template file, 
to be replaced by the appropriate information when the block page is sent:

\begin{description}

\item[\{\{.URL\}\}] the URL of the page that was blocked
\item[\{\{.Categories\}\}] the names of the categories that caused the page to be blocked
\item[\{\{.Conditions\}\}] the conditions of the ACL rule that caused the page to be blocked
\item[\{\{.User\}\}] the user's IP address or username
\item[\{\{.Tally\}\}] a list of the rules that matched, and how many times each one matched
\item[\{\{.Scores\}\}] a list of categories, and how many points the page scored in each category

\end{description}

The block page is generated using the Go template package; see
\verb"http://golang.org/pkg/text/template"
and
\verb"http://golang.org/pkg/html/template"
for documentation.

There is one custom function defined for the templates to use, \verb"eq", 
which tests its parameters for equality.

\section{Virtual Web Servers}

Since the block page may need to refer to external resources
(such as images, stylesheets, and scripts),
Redwood includes an internal web server.
This web server does not accept connections directly,
but whenever Redwood processes a request with a server address of 203.0.113.1,
it directs the request to the internal server instead of processing it normally.
The content of the internal web server is configured with the 
\verb"static-files-dir", and \verb"cgi-bin" directives.

If a more advanced virtual server is needed, you can use the \verb"virtual-host"
directive to transparently redirect requests for a given hostname to a different address,
such as an Apache web server running on your gateway.
If the server is running on your gateway, listening on port 8888, and you want it to be 
available as \verb"myserver.local", use \verb"virtual-host myserver.local localhost:8888".
(Note: proxy settings are not set on the client, and Redwood is intercepting
requests transparently, this will work only if the DNS server resolves the name
to an IP address outside your local network. 
Any IP address will do, though. OpenDNS's website-unavailable address works fine.
Also note that \verb"virtual-host" only works with HTTP, not with HTTPS.)

\section{Test Mode}

If Redwood is run with the \verb"-test" switch, it does not run as a proxy server. 
Instead, it evaluates the URL given as an argument after the switch.
It prints detailed debugging information about how the URL and its content would be rated
if that page were requested in normal operation: how many times each rule matches, 
what the score is in each category, which categories would block the page, etc.

\section{Log Files}

Redwood has several categories of messages that can be logged:

General diagnostic messages are sent to standard error by default,
and may be redirected to a file using normal shell redirection.

The access log has a line for each request processed.
It is in CSV format
and goes to standard output by default.
It can be sent to a file by including the \verb"access-log" directive in Redwood's configuration file.
The access log has the following fields: time, username or IP address, 
action (allow, block, or ignore), URL, HTTP method (GET, PUT, etc.),
HTTP response status (if an HTTP response was being processed), 
content type, content-length, whether the content was modified by Redwood, which rules matched (and how many times), 
the score for each category, and the list of categories that caused the page to be blocked (if it was).
The content length is meaningful only if a phrase scan was performed.

The TLS log has a line for each HTTPS connection that was intercepted.
Like the access log, it goes to standard output by default, 
and it can be sent to a file with the \verb"tls-log" directive.
The TLS log has the following fields:
time, username or client IP address, server name, server address, 
and any error that was encountered.

\section{Authentication}

Redwood can be configured (using the \verb"require-auth" ACL action) to require HTTP basic proxy authentication,
with a username and password.
The usernames and passwords can come from a file that is specified by the \verb"--password-file"
configuration directive. Each line in the file consists of a username,
a space or a tab, and a password.
Alternatively, a program can be specified to perform authentication with \verb"--auth-helper".
Each line of the program's input will be a username and a password, separated by a space.
It should respond with \verb"OK" if the password is correct, or \verb"ERR" if it is not.
One such program is \verb"basic_pam_auth", which is included with Squid.

\section{SSLBump}

Redwood can be configured (using the \verb"ssl-bump" ACL action) to perform Man-in-the-Middle filtering of HTTPS traffic.
This feature is called SSLBump after the corresponding feature in Squid.

For SSLBump to work, Redwood must be configured with a root certificate that is trusted
by the users' browsers.
Paths to the certificate and its private key are specified with the \verb"tls-cert" and \verb"tls-key"
options.
The certificate and key should be in PEM format.

Redwood uses the system root certificates to verify the identity of the sites it bumps.
Other trusted root certificates can be specified with the \verb"trusted-root" option.

\end{document}

